- Regression examples
- Classification example
- Measuring accuracy
- Bias-variance tradeoff
9/8/2020
So far, we learned how to visualize and create numerical summaries of data.
During this course, we will go further: we want to fit an explicit equation, called
regression model, that describes how one variable (the
response \(y\)) changes as a function of other variables (the
predictors \(X\)).
This is called regression, curve fitting or supervised learning: estimating a best guess for \(y\), given \(X\) - a conditional expected value \(\mathbb{E}[y \mid X]\).
For example, you might have heard the following rule of thumb: to calculate your maximum heart rate, subtract your age from \(220\).
We can express this rule as an equation: \[\text{MHR} = 220 - \text{Age}\]
This equation comes from data. The study probably was along these lines:
Data from this study would look like this:
It turns out that Equation 2 is a better equation: it makes smaller errors, on average.
Goals: given training data \(\{(x_1,y_1),\dots,(x_n,y_n)\}\)
Alice is 28. What is her predicted max heart rate?
Our equation expresses the conditional expected value of MHR, given a known value of age:
\[\mathbb{E}[\text{MHR} \mid \text{Age}] = 208 - 0.7 \times 28 = 188.4\]
This is our best guess without actually putting Alice on a treadmill test until she vomits.
How does max heart rate change with age?
\[\mathbb{E}[\text{MHR} \mid \text{Age}] = 208 - 0.7 \times \text{Age}\]
So about 0.7 BPM slower, on average, with every additional year we age.
There is no guarantee that your MHR will decline at this rate; it is just a population-level average.
A common use of regression models is to make fair comparisons by adjusting for the systematic effect of some common variable.
In this case we can adjust by how fast we expect two MHR to be, given two different ages.
Let us compare two people whose max heart rates are measured using an actual treadmill test:
Clearly, Alice has a higher MHR, but let’s make things fair. We need to give Abigail a head start, since max heart rate declines with age.
So, who has a higher maximum heart rate for their age? Key idea: compare actual MHR with expected MHR.
Alice’s actual MHR is 185, versus an expected MHR of 188.4
\[\begin{split} \text{Actual} - \text{Predicted} &= 185 - (208 - 0.7 \times 28) \\ &= 185 - 188.4 = -3.4 \end{split}\]
Abigail’s actual MHR is 174, versus an expected MHR of 169.5
\[\begin{split} \text{Actual} - \text{Predicted} &= 174 - (208 - 0.7 \times 55) \\ &= 174 - 169.5 = 4.5 \end{split}\]
The goals are the same as before:
We can make an assumption on the functional form of \(f(x)\).
Linear model for \(p\) predictors: \[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} = \beta_0 + \sum_{j = 1}^{p} \beta_j x_{ij} = \mathbf{x}_{i}^{\intercal} \mathbf{\beta}\]
Let’s see an example of “nonparametric model”.
Here the response variable \(Y\) is qualitative, e.g. e-mail is one of \(\mathcal{C} = \{\text{spam}, \text{not spam}\}\), digit class is one of \(\mathcal{C} = \{0, 1, \dots, 9\}\).
Our goals are to:
Using the training data a standard measure of accuracy is the mean-squared error \[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n}\left\{y_i - \hat{f}(x_i) \right\}^2\] This measure, on average, tells us how large the “mistakes” (errors) made by the model are…
\[Y = \beta_0 + \beta_1 x\]
\[Y = \beta_0 + \beta_1 x + \beta_2 x^2\]
\[Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3\]
\[Y = ???\]
As we have seen in the examples above, there are lots of options in estimating \(f(X)\)
Some methods are very flexible some are not. Why would we ever choose a less flexible model?
Not too simple, but not too complex!
Suppose we fit a model \(\hat{f}(x)\) to some training data \(\text{Tr} = \{x_i, y_i\}_{i=1}^{n}\), and we wish to see how well it performs.
We could compute the average squared prediction error over Tr, i.e. \[\text{MSE}_{Tr} = \frac{1}{n} \sum_{i \in Tr} \left\{y_i - \hat{f}(x_i) \right\}^2\]
This is usually biased toward more complex models.
Instead we should, if possible, compute it using “fresh” test data \(\text{Te} = \{x_i^\star, y_i^\star\}_{i = 1}^{m}\), i.e. \[\text{MSE}_{Te} = \frac{1}{m} \sum_{i \in Te} \left\{y_i^\star - \hat{f}(x_i^\star) \right\}^2\]
Bias: how does the model perform, i.e. what is the distance between the predicted value and the true value.
Variance: how “sensitive” are the predicted values when new data is obtained.
More to come in this course!
Typically as the flexibility (complexity) of \(f(x)\) increases, its variance increases, and its bias decreases.
In order to control this behaviour, a penalty for models that are too complex should be used.
How to quantify the degree of this penalty? Based on average test error, not on training error.
There is a natural tradeoff between:
Today we saw:
Next time: